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Abstract 

The measurement of the efficiency of some selection or process is always an 
important part of the analysis of experimental data. The statistical tech- 
niques based on the use of the Bayes theorem which are needed to determine 
the efficiency and its uncertainty are reviewed. The problem of choosing a 
meaningful prior is addressed, and different priors are considered in real-life 
use cases. The use of the uncertainties in practical cases is also considered, 
together with the problem of combining different samples. "Pathological" 
cases are also addressed, in which non-unit weights or non-independent se- 
lections have been used to fill the histograms. The use of the family of 
Beta distributions is illustrated in the examples, making use of its conj un- 
gate property for binomial sampling (if the prior belongs to such family, the 
posterior is also a Beta distribution). Finally, several recommendations are 
made about the choice of the prior and about using and communicating the 
results. 

Key words: efficiency, Bayesian approach, reference analysis, non uniform 
priors 
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1. Introduction 

There are two cases in which we are interested into measuring an ef- 
ficiency: either we want to select the subset of all measured events that 
correspond to some criterion, or we want to measure the work done by some 
device as function of the input energy. In both cases, we take the ratio be- 
tween two homogeneous quantites, so that the efficiency is a dimensionless 
number, interpreted as a probability. 
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16 The efficiency of a selection is interpreted as the probability that any 

17 single event passes the selection, whereas the efficiency of a device (or of a 
is generic process) is interpreted as the fraction of the input energy returned as 

19 output of the device in the form of some work. If we consider the energy as a 

20 discrete quantity (indeed, it consists of a very large number of quanta), then 

21 one can equivalently say that the device efficiency is the probability that a 

22 single input quantum of energy is returned as a single output quantum of 

23 work. 

24 When speaking about a selection process, we count the initial number 

25 of events and the final number of selected events. Usually, one also knows 

26 the distribution (actually, the histogram) of some control parameters before 

27 and after the selection, and wants to determine the efficiency as function of 

28 such parameters. The case of the device efficiency can be re-conducted to 

29 the former, by noting that one usually makes repeated measurements of the 

30 output while keeping the input constant (within good precision), or fills the 

31 histograms of input and output values to find how the efficiency depends on 

32 the input. 

33 In this paper, only the discrete case of a selection process is considered 

34 and the explicit examples always deal with unique events. However, the same 

35 results also apply to the case of indivisible quanta, so that they can also be 

36 applied to any general process. The aim is to show how to estimate the effi- 

37 ciency and its uncertainty within the framework of Bayesian statistics. The 

38 frequentist approach, often preferred in high-energy physics data analysis, is 

39 not addressed here — the reader is kindly invited to find the recent review 

40 by [1] and the short comparison between the two approaches in [2], chap. 32] 

41 — for the following reasons. 

42 The experimenters most often ask about the probability that, given the 

43 data, some the true value assumes some value. This question is answered by 

44 the Bayesian approach but is ill-defined in the frequentist approach, because 

45 the true quantity (the efficiency) is viewed as a fixed unknown parameter 

46 which does not "fluctuate" , so that a probability distribution cannot be de- 

47 fined. On the other hand, in the Bayesian framework the probability state- 

48 ments always refer to some state of knowledge, so that it is possible (and 

49 indeed it is desired) to define a probability distribution for the true quantity, 

50 interpreted as a description of our degree of belief about its unknown value. 

51 In particular, the frequentist solution is often expressed as a "confidence 

52 interval" with some degree of "coverage". The two concepts are closely re- 

53 lated to the ideal repetition of the same experiment, exactly in the same 
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54 conditions, for a large number of times: a 95% confidence interval is ex- 

55 pected to contain the true (unknown) value of the parameter for (not less 

56 than) 95% of the repeated experiments, while (at most) 5% of them will miss 

57 it. 

58 In practice, repeated experiments with identical conditions are only pos- 

59 sible in computer simulations. In real cases, this long sequence of measure- 
eo ments is either (1) impossible - - because the experiment is too complex 
ei and/or too expensive, or because perfect stability in time is impossible to 

62 get — or (2) differently interpreted as a single and longer experiment — this 

63 is done every day in high-energy physics, when people collect together sev- 

64 eral runs to get higher statistics rather than interpreting them as repeated 

65 measurements — . In addition, a confidence interval cannot be used together 
ee with the best estimate (usually, the maximum-likelihood estimate or MLE) 
67 as a "two-sigma" range, because it cannot be interpreted as the width of 
es a probability distribution^ Hence, when the efficiency is needed as inter- 

69 mediate step in the computation of some physical quantity, the frequentist 

70 solution is not very friendly. 

71 In the Bayesian approach, the likelihood (i.e. the probability that an event 

72 occurs, given the model) also plays a central role. However, expecially when 

73 the input number of events is very low, one must also pay attention to the 

74 "prior" probability density function (PDF), which represents our degree of 

75 belief about the possible values of the unknown efficiency before the experi- 

76 ment is actually carried out. Indeed, the full Bayesian solution is provided by 

77 the "posterior" PDF, which is proportional to the product of the likelihood 

78 with the prior (Bayes theorem). Unless the prior is pathologic (e.g. null or 

79 negligible in the region where the true value is), the likelihood will "attract" 
so the posterior more and more, as the number of input events increases. 

si When the full posterior PDF is not desired, it can be summarized by 

82 providing its mean and some measure of its dispersion (usually the standard 

83 deviation), which can be used in the computation of the desired physical 

84 quantities in the usual way. The only caveat is that working with the stan- 

85 dard deviation implies assuming that the underlying PDF is somewhat sym- 
se metric about the mean, which in some case might be a bad approximation 



87 (section 2.2). In general, it is recommended to work with the full posterior 



lr rhis is often a source of confusion among experimenters, used to play with normal 
distributions and to interpret the results in a (unconscious) Bayesian way. 
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88 whenever it is possible. 

89 In the rest of the paper, the problem of estimating the selection efficiency 

90 and its uncertainty (a concept which only makes sense in the Bayesian frame- 

91 work) is addressed. The choice of the prior requires some care, so that rec- 

92 ommendations are made in section [3] to deal with the usual practical cases 

93 of informative and non-informative priors. Few cases in which the assump- 

94 tions of independent selections with unit weight are not both satisfied are 

95 addressed in section |4| In addition, the problem of fitting parametric models 

96 is addressed in section |4.5| Finally, appendix [A] shows useful mathematical 

97 relations. 

98 2. How to measure the efficiency and its uncertainty? 

99 Being a probability, the efficiency cannot be directly measured. Instead, 

100 we must estimate it with the available data: we can only count events, i.e. 

101 measure relative frequencies, that are rational numbers. We assume that 

102 the efficiency is a real number between zero and one, to which the relative 

103 frequency tends (in probability!) in the limit of an infinite number of mea- 

104 surements by virtue of the Bernoulli theorem. 

105 It is important to emphasize that the tendency of the measured frequency 
we to the probability has not the same behavior of a mathematical limit, for 
107 which it never happens, after some point, that the distance to the limit ex- 
los ceeds a given small quantity. Indeed, it is always possible to find, even after 
109 a large number of trials, a frequency that is not very near to the probability, 
no although the probability for this to happen decreases with increasing number 
in of trials (Bernoulli theorem). This makes measuring probabilities conceptu- 
n2 ally different from measuring physical quantities like e.g. the electric field in 
in some point, that is given by the mathematical limit of the measured force 
H4 divided by the test charge, when the latter goes to zero. However, in practice 
us this difference is not very important, and this might explain why many good 
lie books on statistical methods of data analysis omit to comment on this. 

Usually, we create and fill histograms, assuming that they converge to 
the true distributions in the limit of infinite statistics and zero bin width 
as the partial sums converge to the Riemann's integral]^] Hence, we usually 
approximate the PDF e(x; A) describing the efficiency of the selection A 



2 Again, this is not rigorously correct: the histograms converge in probability. 
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as function of the parameter(s) x, with the step function representing the 
observed relative frequencies {fi(A}}: 



e(x;A)dx*f i (A)~ 



in where ni and ki(A) count the entries in the i-th bin before and after the 
us selection A. 



119 2.1. The binomial distribution 

120 A histogram of a quantity x obtained with a series of repeated measure- 

121 ments of a; is a collection of pairs (i, n*) representing the number rij of times 

122 the measured value of x has been found in the i-th bin. The integer val- 

123 ues Hi have been observed, hence they have no uncertainty. However, if we 

124 consider the histogram as an estimate of the true distribution of x, then the 

125 nj's are estimates of the integral of the true distribution in each bin. In the 

126 assumption that the populations of all bins are statistically independent, the 

127 uncertainty <x; of the estimate n, for the true population n* of bin % is given 

128 by the Poisson distribution: <7j = yjni. 

This assumption is usually justified, but in our case the entries n, and 
ki of bin i before and after the selection are not statistically independent, 
hence one can not compute the variance of /j with the usual rules of the 
"propagation of errors" J^] Rather, the application of a selection on each bin 
can be considered a binomial process, with probability of "success" £j, the 
true (but unknown) efficiency: the probability to obtain ki events passing 
the selection when the efficiency is and the sample size is is: 

P(k t \e l ,n l )= ^( £i )**(l - £i )«<-* =1^(^,0 (1) 

129 with mean E(ki\£i,rii) = SiHi and variance V{ki\£i,ni) = £i(l — £j)nj. How- 

130 ever, this does not solve our problem, because £j is still unknown (instead, 

131 we measured n t and ki). 



3 Please, note that this is the default behavior of the bin-wise histogram division per- 
formed by ROOT 3 , unless special options are provided. 
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2.2. The usual (abused) approximation 

In practice, often people compute V(k\n, e) = ne(l — e) from ([!]) — we 
drop the index i for a while — and find the approximate variance for the 
relative frequency / = k/n by dividing it by n 2 , which has no uncertainty 
because it has been measured. Finally, the result is approximate^ by the 
substitution £—»•/: 

V (f) = V ^ = 5(1 ~ £) » = k[ ~ n - k) . (2) 

n 2 n n n 3 

Though this is a good approximation when both n, k are large and the ob- 
served frequency is not too similar to zero or one, rigorously speaking the 
formula ^ is incorrect because the binomial distribution ([T| is a function of 
ki with parameters and n» whereas in our case n, and ki are both known 
and we want to find Ef we should look instead for a function of £j, with 



138 parameters rij and ki, as it will be shown in section 2.3 In addition, this 

139 approximation suffers from the following problem: the formula (|2| gives zero 

140 uncertainty for the two limiting cases k = and k = n, independently from 
mi the actual value of n. This means that, if we have a single event (n = 1) 

142 and this survives the cut, we get the very same result (zero uncertainty) as 

143 the case k = n = 100, whereas one would expect the latter estimate to be 

144 (roughly 10 times) more precise. 

145 2. 3. The probability distribution for the true efficiency 

The correct way of finding the uncertainty associated to / involves the 
Bayes theorem |H [3] : we know ki and and that the process is binomiaj^J 
and we want to determine the probability P(ei\ki,rii) that £j is the true 
efficiency for bin i. From the Bayes theorem we find that 

P(£i\ki, Hi) = — Bi(^|£j, Hi) P(e t ) (3) 

146 (where Cj is a normalization constant) is the probability that the true ef- 

147 ficiency is between and e,i + dsi. The likelihood P(ki\ei,ni) is given 

148 by the binomial distribution Bi(/cj|£j, n») of equation Q, whereas the prior 



4 This approximation is used by ROOT when the option "B" is given to the histogram 
division. 

5 The histograms must be filled with unit weight. 
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Figure 1: Probability density function P(e\k,n) for n = 10 (left) and n = 100 (right). 



P(si\ni) = P{ei) encodes our knowledge of the true efficiency before we car- 
ried on the experiment. 

Both Paterno |1] and Ullrich &Xu j5] considered as a reasonable choice 
for the prior P{ei) a uniform distribution in [0, 1], though this is question- 
able because it does not represent a "complete ignorance" (the problem of 
choosing the prior is addressed again in section [3]). On the other hand, the 
choice of the uniform prior is interesting because it makes the posterior PDF 
proportional to the likelihood, so that the usual MLE (the frequentist solu- 
tion) coincides with the mode of Q. In this case the normalization constant 
is Ci = (rii + so that the result is (dropping again the index i) 



P(e\n, k) = (n + l)U )e*(l-e 



n 1 ! - -n-k 



k\ (n-k)\ 



e k (l-e) n ~ k (4) 



for < e < 1 or otherwise. 

Figure [l] shows the result of Q with 10 and 100 initial events, and 
allows for comparing the distributions corresponding to the same relative 
frequences. Equation Q has the form of a Beta distribution (see the ap- 



pendix A. 2) with parameters a = k + 1 and b = n — k + 1 (the uniform 



distribution is the particular case of the Beta density with a = b = 1). It 
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follows that the mean of Q is 



E(e) = I e P(e\n, k) de = ^1 , :> , 

Jo 



n + 2 

and the mode, the value at which dP/ de = 0, is 

k 



mode(e) 



n 



The usual maximum-likelihood estimator / = k/n is indeed the value which 
maximizes (|4]). Because of the shape asymmetry, this does not coincide with 
the expectation value, a biased estimator of the true efficiency The variance 
is: 

Vie) = + + (6) 

V{£) ( n + 2)2(n + 3) {) 

which has the expected behavior for k — 0,n (i.e. V{e)\k=Q, n oc 1/n 2 when 
n — > oo): 

y(e)lfc=o,n= , Jow >0 - 
[n + 2)^(n + 3) 

152 Incidentally, figure [T] is also interesting for the frequentist approach. Be- 

153 cause the only difference between the posterior Q and the binomial like- 

154 lihood is the normalization (the likelihood is not normalized), such figure 

155 shows the shape of the likelihood for different parameters and is useful to 

156 check by eye if the symmetric Gaussian approximation is well justified. A 

157 more precise comparison is shown in table [T] of section 3^ 



2.4- Quantifying the uncertainty 

When the efficiency is needed to convert the measured quantities into the 
true ones (to get e.g. the cross section), its best estimate is needed together 
with its uncertainty (whenever using the full distribution is not practical). 
The mean and variance of the posterior PDF can be used in computing 
quantities following the recipe of the "error propagation". The only caveat 
is that, in general, the posterior is asymmetric so that the trivial recipe of 
taking "±n<r" intervals around the expected result might produce intervals 
which extend into an unphysical region. The asymmetry is more pronounced 
when the efficiency is very near to one or zero, and is negligible when the 



lea Gaussian approximation of section 2.2 is good. In general, it is recommended 



169 to work with the full posterior whenever it is possible. 
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170 Because physicists are used to take the interval /j±(jj, with <jj = \/V{ei) 

171 as the Gaussian credible interval with 68.3% probability, this feature is often 

172 desired: Paterno jl] recommends using the smallest interval [a, b] C [0, 1] that 

173 contains the probability A = 68.3%, i.e. the shortest credible interval with 

174 posterior probability A (known in the statistics literature as the "highest 

175 posterior density" or HPD region), arguing that in practical applications 

176 it will behave more or less as the "±lcr" interval defined with a Gaussian 

177 standard deviation. He also provided code to compute such interval when 

178 using the uniform prior. 

179 We emphasize that this prescription is not invariant under reparametriza- 
lso tion (such credible interval is no more the HPD region after a general change 

181 of variable), which is a very desirable feature. It is recommended instead to 

182 show the 95% "reference credible intervals" [10] in the efficiency graph (see 

183 the appendix [B]), which have also approximate coverage when interpreted 

184 in the frequentist way. However please note that, depending on the actual 

185 values of the parameters n, k, such region can overcover or undercover the 

186 true value in repeated experiments, while the best practice in the frequentist 

187 approach is to choose intervals which never undercover (for more details, see 

» CP). 

189 Showing asymmetric errors is the recommended style for efficiency graphs, 

190 because the posterior PDF in general is not symmetric. The ROOT method 

191 TGraphAsymmErrors::BayesDivide() can be used to plot efficiency graphs 

192 showing Paterno's intervals (from the uniform prior) as asymmetric errors on 

193 the relative frequencies. Unfortunately, it does not support non-integer input 

194 (ROOT 5.26/00 is considered here), so that it cannot be used to plot intervals 

195 corresponding to any possible choice for the prior (most notable, it cannot be 
we used with the Jeffreys prior). In addition, Paterno's intervals are not invari- 

197 ant under reparametrization. So far, there is no available ROOT method for 

198 plotting reference credible histograms, though RooStats::BayesianCalculator 

199 can be a starting point to find central credible intervals. The latter contain 

200 the same probability on the left and on the right of the expected value and 

201 are invariant. 

202 Using the frequentist approach, the ROOT class TBinomialEfficiencyFit- 

203 |ter| can be used to fit the experimental points with a theoretical model using 

204 the maximum likelihood method (the input histograms must be filled with 

205 weights of 1). The best values are found by maximizing the sum of the bino- 

206 mial log-likelihoods defined for each bin in the input histograms (which must 

207 have the same binning) and, from the Bayesian point of view, it gives the 
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208 correct answer when the Gaussian approximation of section 2.2| is acceptable. 

209 The frequentist choice for the asymmetric confidence intervals is reported on 

210 chapter 32 of [2]. 



2ii 3. The choice of the prior 



The prior encodes our state of knowledge before the measurement is car- 
ried out. Most often, either we precisely know the prior PDF or we want 
to model a state of "perfect ignorance". The first case happens for example 
when we already know the efficiency of some device from a recent calibration 
procedure, while the second case could be the attempt of providing the most 
"objective" estimate of the efficiency. 

3.1. The Beta distribution 

If we have some knowledge before we measure the relative frequency, it 
should be encoded into a specific form for the prior PDF. One does not need 
to be very precise, because the Bayes theorem assures that the final result 
(the posterior probability) will be driven by the data, provided that we have 
enough events and that the prior is not completely wrong (i.e. the likelihood 
peak is not in the region where the prior is negligible). 

The recommended family for any prior PDF (informative or non- informative) 



is the Beta family (appendix A.2), because its natural conjungate property 
for binomial sampling of k successes among n trials brings a Beta prior 
with parameters a, b into a Beta posterior with parameters a' = a + k and 
b' = b + n — k. Simple formulae exist to express the mean, mode and variance 



of the resulting posterior distribution, as shown in appendix A.2, so that we 
can use the "method of moments" to find the Beta parameters which best 
match our prior knowledge of the value and uncertainty of the true efficiency. 
We can assume that such values are the mean E and variance V of the prior 
PDF and find the parameters a, b as 



E 



E{l-E) 



V 



[l-E) 



E{1 — E) 
V 



(7) 
(8) 



to model the prior as a Beta distribution. 

If the Beta parameters a, b are integers, one can use |TGraphAsymmEF 



rors::BayesDivide() to plot 68.3% credible regions by passing it modified 
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238 histograms filled with k{ = a, — 1 and = bi + a t + 2. Unfortunately, as 

239 said above, such method is not general enough to handle the general case of 

240 non-integer parameters for the posterior Beta density. 

241 3.2. Non-informative priors 

242 The choice of a uniform prior (a Beta distribution with a = b = 1) could 

243 seem "non-informative" and appropriate if we have no prior knowledge of 

244 P{e) and we do not have any reason to prefer any particular range for the 

245 true efficiency However, one may also think in terms of the logarithm (or 

246 some other function) of the efficiency and want a non-informative prior also 

247 for such parametrization. Unfortunately, the uniform prior is not invariant 

248 under reparametrization, so that the prior written as function of log e is no 

249 more uniform. In summary, the uniform prior is a legitimate informative 

250 prior when we have good reasons to use it, but is not a good choice when 

251 one wants to model the situation of maximum prior ignorance. 

252 The goal of the Bayesian "reference analysis" [B] is to study the impact of 

253 the choice of the prior on the posterior PDF, with respect to the case of the 

254 minimal possible prior knowledge. The "reference posterior" is the Bayesian 

255 result which minimally depends on the prior knowledge (equivalently, it max- 

256 imally depends on the likelihood), and the corresponding "reference prior" [7j 

257 is used in the Bayes theorem to model the maximal ignorance on the system 

258 before carrying out the experiment. One may view the reference posterior 

259 as the "most objective" Bayesian solution, and consider using it whenever 

260 a result should be published in a way which minimally depends on the ex- 

261 perimenter knowledge, or when it is desired to assess the dependency of the 

262 solution on the choice of the prior — the reader may find all details about 

263 the use of reference analysis as a tool for objective Bayesian inference in [8]. 

264 One of the requirements for the reference analysis is the invariance un- 

265 der reparametrization. For the binomial case, the reference prior coincides 

266 with the Jeffreys prior (a Beta distribution with a = b = 1/2, proportional 
to e -1 / 2 (l — £)~ 1 / 2 ), which was indeed found by requiring invariance under 
reparameterization [H] . In case one wants to choose a non-informative prior, 

269 either because the maximum prior ignorance is indeed the best model, or be- 
cause one wants to assess the dependence of the solution (the posterior PDF) 
on the choice of the prior, the Jeffreys prior is the recommended choice. 



267 
268 



270 
271 



272 From the relations (19) of appendix A. 2 we immediately get the following 



273 values for the expected value, mode and variance of the reference posterior, 
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274 obtained with the Jeffreys prior: 



A: +1/2 

m = ( 9 ) 

k - 1/2 

m(e) = (10) 

(fc + l/2)(n-fc + l/2) 
lJ (n + l) 2 (n + 2) 1 J 

275 (the mode is only defined for n > 1). The result is that both the mean and 

276 the mode are biased (but robust) estimators of the efficiency. 

277 A comparison with the formulae (]5]) and ^ shows that the reference 

278 posterior (obtained using the non-informative Jeffreys prior) is not much 

279 different from the posterior obtained with the uniform prior, unless n is 
relatively small. The difference with respect to the frequentist MLE is usually 
larger, and is most apparent when the measured relative frequency is one 

282 or zero (never the best estimate in the Bayesian approach), though is less 
significant for the reference posterior. 

The table [I] shows the mean (or MLE) and square root of the variance 
285 for all cases, together with the ratio between the posterior mean or variance 
obtained with the uniform prior and the corresponding quantity computed 
with the reference posterior. Clearly, the biggest discrepancy between the 

288 results obtained with the uniform and Jeffreys priors is obtained when k — 0, 

289 in which case the expected value using the uniform prior is (n + 2) _1 and 

290 the reference posterior mean is (2n + 2) _1 , increasing for higher n. The ratio 

291 between the standard deviations is larger than one at very small and large k 

292 and smaller when the efficiency is intermediate, whereas the uniform mean 

293 is larger than the reference posterior mean for small relative frequencies and 

294 smaller for large relative frequencies (the means are equal only when n = 2k). 

295 Given that the reference posterior is "more objective" and that its mean is a 

296 less biased estimator of the true efficiency, the recommendation is to use the 

297 Jeffreys prior unless there are reasons to use the (informative) uniform prior. 

298 3.3. Combining independent samples 

299 If we have two independent efficiency measurements for the same process 

300 and we want to use all available information, the correct approach to combine 

301 them is to merge the samples before and after the selection and use the results 

302 to make the final estimate (this is also true for the MLE). In the Bayesian 



280 
281 
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5 


1.000 


0.000 


0.857 


0.124 


0.917 


0.104 


0.935 


1.184 


10 





0.000 


0.000 


0.083 


0.077 


0.045 


0.060 


1.833 


1.275 


10 


1 


0.100 


0.095 


0.167 


0.103 


0.136 


0.099 


1.222 


1.043 


10 


2 


0.200 


0.126 


0.250 


0.120 


0.227 


0.121 


1.100 


0.993 


10 


3 


0.300 


0.145 


0.333 


0.131 


0.318 


0.134 


1.048 


0.972 


10 


4 


0.400 


0.155 


0.417 


0.137 


0.409 


0.142 


1.019 


0.963 


10 


5 


0.500 


0.158 


0.500 


0.139 


0.500 


0.144 


1.000 


0.961 


10 


6 


0.600 


0.155 


0.583 


0.137 


0.591 


0.142 


0.987 


0.963 


10 


7 


0.700 


0.145 


0.667 


0.131 


0.682 


0.134 


0.978 


0.972 


10 


8 


0.800 


0.126 


0.750 


0.120 


0.773 


0.121 


0.971 


0.993 


10 


9 


0.900 


0.095 


0.833 


0.103 


0.864 


0.099 


0.965 


1.043 


10 


10 


1.000 


0.000 


0.917 


0.077 


0.955 


0.060 


0.960 


1.275 



Table 1: Comparison between the posterior means and standard deviations obtained with 
the uniform and Jeffreys priors. The frequentistic MLE and its standard deviation are 
also shown. 



303 approach, the very same result is also obtained if we use the first posterior 

304 PDF to model the prior for the second estimate. Indeed, the Bayes theorem 

305 can be interpreted as a model for our learning process: it makes use of all 

306 information available at any given time, a very desirable property. 

307 To make an example, let us consider the histograms {(i, ki)} and {(i, fcQ}, 

308 filled with all events in the first sample and with the subset obtained after 

309 the selection, and the analogous histograms {(i, rij)} and {(i, n-)} filled before 
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310 and after the cut with the second sample. Here we assume that all histograms 

311 are accessible, however the result is also valid if one of them is missing and we 

312 are given the posterior PDF which summarizes the unavailable measurement. 

313 We take a single bin (omitting its index) and assume that the prior PDF 

314 for the first measurement is a Beta density with parameters a, b (in case of no 

315 prior knowledge we recommend using the Jeffreys prior with a = b = 1/2), 

316 so that the posterior of the first measurement is the Beta density Be(e; a + 

317 k',b + k — k'). This then is used as the prior for the second estimate, whose 

318 posterior PDF is the Beta density Be(e; a + k! + n', b + k — k' + n — n'). The 

319 very same posterior is obtained if we start we the initial prior Be(e; a, b) and 

320 consider the joint sample of size m = k + n from which only m! — k' + n! 

321 events survived. 

322 With the same method, we can also use simulated data to provide the 

323 prior distribution for the true efficiency, to be used in conjunction with real 

324 data in the Bayesian approach. This is mostly useful in case we need to test 

325 different kinds of "systematic" effects on the outcome of a real experiment. 

326 Given that using densities belonging to the Beta family allows to sum- 

327 marize all available information in a rather easy way, the recommended way 

328 of communicating efficiencies is to provide the values of the 4 parameters of 

329 the Beta posterior and of the Beta prior. This is equivalent to the knowl- 

330 edge of the original samples (before and after the selection) and can be used 

331 to make a combined estimate of the efficiency without the original data: it 

332 is sufficient to use the corresponding Beta distribution as prior for the new 

333 measurement. In addition, the users will be able to test the sensitivity of the 

334 result to the choice of a different prior. 

335 4. Non trivial use cases 

336 4-1- What to do with histograms not filled with unit weight? 

337 So far, we assumed that all entries of the initial histogram had unit weight 

338 and had been selected by an independent process. This may not be true in 

339 all cases, as it happens sometimes in high-energy physics. For example: 

340 1. the initial histogram {(i,Ui)} was obtained by scaling the simulated 

341 data sample to normalize it to some different value of the cross section. 

342 The histogram should not be used to make efficiency studies! Rather, 

343 the efficiency should be estimated by using the original histogram (filled 

344 with unit weights); 
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345 2. the initial histogram {(i,rii)} was obtained as the weighted average 

346 of several contributions (for example, by combining simulated samples 

347 having similar statistics but corresponding to very different cross sec- 

348 tions). As above, the histogram should not be used to make efficiency 

349 studies, which require the use of the original histogram (filled with unit 

350 weights); 

351 3. the initial histogram {(i,rii)} has been filled using weights ±1, accord- 

352 ingly to some perturbative expansion in which single terms may give 

353 a positive or negative contribution to the final production probability. 

354 This use case is explicitly addressed in section |4.2| 

355 4. the numbers of events before and after the cut have been obtained with 

356 a different procedure than simply counting events. For example, they 

357 could come from fits as in section 4.3 

358 5. the initial sample was selected by a non independent process. The case 

359 of the trigger efficiency is explicitly addressed in section 4.4 



360 Events with positive or negative unit weights 

361 In high-energy physics simulations, it might happen to work with samples 

362 filled with positive and negative unit weights (this happens for example in 

363 the output of MC@NLO [12])- Each individual event is independently simu- 

364 lated, and knows nothing about its weight. Hence we can separately consider 

365 the samples with positive and negative unit weights, with n + , n_ initial num- 

366 bers of events and k + ,k- entries after the selection. For each sample, the 

367 efficiencies e + and e_ can be computed individually following the methods 

368 already seen in previous sections: with the Jeffreys priors, their posteriors 

369 are Beta distributions with parameters <2j = ki + 1/2 and bi = — ki + 1/2, 

370 with i — +, — . 

371 However, we are interested in the overall efficiency, after subtraction of 

372 the two samples. For MC@NLO, its authors say that the efficiency should be 

373 estimated as / = (k + — kJ)/(n + — nJ) when k + > k_ or zero otherwise, and 

374 they suggest to use the usual "propagation of errors" to estimate its variance 

375 whenever the numbers are high enough that the Gaussian approximation 

376 holds, or to run many MC samples through the cuts and look at the dispersion 

377 in the result if the data sample is too small [T3] . 

Here we use instead the method of moments to find the parameters of the 
posterior Beta distribution that matches the approximate mean E(e) ~ / 
and its approximate variance. One may write / = (k + — kJ)/{n + — n_) = 
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(f+ n +~ f- n +)l { n +~ n -)i which is our estimate for (e + n + — £_n + )/(ri + — nJ). 
The latter has variance 



S78 
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381 



386 
387 



= niV( g+ )-nly (£ -) 
In 4- — n_ r 



where V(e + ), V(e_) are computed with the individual samples of homoge- 
379 neous weights. Using equations ^ and ([7]) one can finally find the posterior 
Beta density which gives the desired result. This is the recommended ap- 
proach, given that the other method suggested below is much more complex. 

382 Please note that that the method above can be easily generalized to be 

383 used in case we are told to consider / = (£\ Wiki)/(^2 i WiJij) as the best 

384 estimate of the efficiency from a mixture of samples with weights Wi, provided 

385 that we know all pairs (rij, &») of events before and after the selection for each 
homogeneous sample. 

Another possibility is to use the general results obtained by by Pham- 

388 Gia et al. [H], who found a (rather complex) analytical expressions for the 

389 case in which one makes the difference between two independent random 

390 variables, each one following a Beta distribution. The relevant properties 



391 of their "beta-difference" distribution are summarized in appendix A. 4 In 

392 general, the difference has domain in [—1, +1], hence in our case the posterior 

393 needs to be set to zero for negative values and re-normalized. A numerical 



394 approach needs to be used to handle the resulting posterior, equation (27). 

395 Also, note that uniform priors were used in the computation, which makes 

396 their solution appropriate only when such choice does not make troubles. 

397 Jf.,3. Events estimated by fits 

398 Let us consider the case in which we have a distribution of N events 

399 which is a mixture of "signal" (S) and "background" (B) events, and we 

400 want to know the effect of some cut on the two components. We obtain 

401 separate estimates n s , rib for the initial numbers of S and B events by fitting 

402 the distribution with a function which is a mixture of two PDFs with relative 

403 weights x s and Xf, — (1 — x s ), so that n s = x s N and rib = (1 — x s )N may 

404 be non-integer positive numbers with the constraint n s + rib = N. Later, we 

405 apply the cut under study and fit the resulting distribution, containing K 

406 events, to obtain the estimates k s , kb of the final numbers of S and B events. 

407 Again, the fits provides the relative weight y s for the S component, so that 

408 k s = y s K and kb — (1 — y s )K, with k s + k b = K. We are interested into the 
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409 efficiency of the cut for signal events, which would be naively expected to be 

410 k s /n s , though this is not the correct guess, as shown below. 

411 A possible way of approaching the problem is to consider the following 

412 distinct phases: 

413 1. The first fit separates S from B events from the initial distribution and 

414 it is assumed to provide the best estimates of n s and its RMS r s . This 

415 can be considered a binomial process in which we select n s = x s N 

416 signal candidates out of N, so that the result can be cast with the 

417 method of moments in the approximate form of a Beta density whose 

418 mean is x s and whose variance a 2 s is also given by the fit: a s = r s /N. 

419 This means that we consider the weight x s returned by the fit as the 

420 posterior estimate of the efficiency of the initial signal selection. The 

421 obtained beta density Be(x; a, b) will be used as prior PDF for the next 

422 step. Incidentally, we note that the non-integer nature of n s and k s is 

423 not a problem here, because the posterior Beta density is not required 

424 to have integer parameters. 

425 2. We are interested into the cut efficiency for signal candidates, that is 

426 the probability that an event both passes the cut and is assigned to the 

427 S population by the second fit. We can either consider this a unique 

428 selection process or split it into P(cut, fit) = P(fit|cut)P(cut). The 

429 latter case implies one more iteration of the Bayes theorem, but here 

430 we consider the unique selection because it is simpler and we are not 

431 interested into the details of the second fit. 



432 



From the discussion above, we know that the posterior PDF describing the 

433 efficiency of our cut for signal candidates is a Beta density with parameters 

434 a' = k s + a and b' = n s — k s + b, where a, b have been determined with 

435 the method of moments from the first fit. Hence, the expected efficiency for 



437 



440 
441 
442 
443 



436 our selection is given by equation (19) from appendix A. 2: a' /(a' + b' 



(k s + a)/ (n s + a + b), which is different from the naive value k s /n s , in that it 

438 explicitely takes into account the effects of the fitting procedure, i.e. of the 

439 classification in "signal" and "background" events. 
Because in general a ^ 1/2 and b ^ 1/2, the final Beta density is differ- 
ent from what is obtained from simple event counting using Jeffreys prior. 
Depending on the fit properties (expecially on the first fit), the final PDF 
may be wider than the reference posterior, most notably when the number 

444 of events is small. The fitting procedure has some cost: intuitively, a fit 

445 which better discriminates between signal and background events will pro- 
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B <- 



Figure 2: Non-independent selection. The sample [o] only exists 
with simulated data, that can be used to make a first estimate of the 
efficiency of selection A by taking the bin-wise histogram division 
[Y]/|~0]. With real data, the least biased sample [jT] is obtained by 
selecting events with B. Next, imposing condition A on [2j one gets 
the sample [§]. 



vide a narrower density, whereas a poor fit will end up into a wider one. The 
good point of the Bayesian treatment is that it accounts for all the available 
information, not that it produces narrower distributions. 

4-4- What to do if the samples are not independent? 

The case in which the initial histogram {(i,rii)} does not represent a 
451 statistically independent sample is expecially important in trigger efficiency 
measurements, when there is no other trigger selection which is statistically 
uncorrelated with respect to the signature A under study. Figure [2] shows 
454 a situation in which one wants to study the systematic effect of a previous 



450 



452 
453 



458 



460 



471 
472 
473 



455 trigger selection B on A starting with the true distribution |_0_|, on MC data. 

456 As we have seen, the best estimate of the efficiency of A as function of some 

457 quantity x is given by the histogram ratio between the distribution of x after 



the selection (the histogram filled with sample 1 ) and its distribution before 



459 (sample |_0_[). Real data can only be taken with trigger B (which in practice 



is chosen to be the least correlated as possible to A), obtaining sample 2 



461 Later, condition A can be required on [2J obtaining sample [3J, which has 

462 been selected by requiring both B and A, and the histogram division 3/2 

463 estimates the probability P{A\B) to select one event with A, given that it 

464 was already selected by B. 

465 In order to find the desired true efficiency P{A), we make use of the 

466 relation defining the conditional probability, P(A ■ B) = P(A\B)P(B) = 

467 P(B\A)P(A), obtaining P(A) = P(A\B)[P(B)/P(B\A)], where the fraction 

468 in brackets cannot be determined with real data alone. The true efficiency of 

469 B alone is found with simulated data by requiring condition B on sample _0_, 

470 whereas the relative efficiency P{B\A) of B with respect to A is obtained by 
requiring condition B on sample [T]. The conclusion is that, without some 
statistically independent trigger, one can not estimate the trigger efficiency 
using real data only. Rather, a simulation is required to measure the impact 

474 of the non-independent trigger B on the selection A under study. 
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Composite model: Landau + Gaussian pedestal | Threshold function: erf with center at 6 a.u. and o = 0.2 a.u. 



1 




10 20 30 40 50 2 4 6 8 10 



energy (a.u.) E (a.u.) 

Figure 3: Probability density function followed by the simulated sample (left) and true 
threshold efficiency (right) as function of the energy (in arbitrary units). The Landau 
distribution has parameter = 8.0 and scale = 1.0 whereas the pedestal Gaussian peaks 
at 5.0 a.u. with 0.2 a.u. standard deviation. The threshold (right) is modeled as an error 
function with center at 6.0 a.u. and standard deviation of 0.2 a.u. 



475 If the approximation in which A and B are independent is good enough, 

476 the value of the fraction can be considered equal to one in all bins, and 



479 
480 



477 the bin-wise ratio between |_3_| and [_2j gives a good estimate of the true 

478 



A efficiency. Such approximation is justified if the (systematic) effect of 
P(B)/P(B\A) is small compared to the statistical uncertainty on the ratio 



between |_3j and |_2j (which might not be true in all bins), i.e. when the square 

481 root of the variance is significantly larger than 1 — P(B)/P(B\A). Otherwise, 

482 the recommended approach is to model the knowledge about such ratio with 

483 a Beta density in each bin, and obtain the Beta posterior for the efficiency 

484 of the selection A as explained before. 

485 Jf..5. Fitting examples 

486 Here we consider a "real life" example: a simulation of a common experi- 

487 mental setup, the measurement of the selection efficiency versus the threshold 

488 on a scalar quantity, and the fit of the resulting "turn-on" plot in ROOT. 

489 Because no Bayesian fitting technique is yet available in ROOT0 a prag- 

490 matic approach is to test the different fitting options available in the release 

491 (5.26/00) used for this work. 

492 The model describes the energy lost by minimum ionizing particles (M IPs) 

493 while crossing a thin slab of active material (e.g. a scintillator) as a Landau 



6 Work is in progress on this side. Lorenzo Moneta, private communication. 
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Figure 4: Left: full spectrum, before and after the cut (the threshold position is indi- 
cated by the dashed line). Right: fits done with binomial errors (top-left), the square 
root of the reference variance |TT| ) (top- right), HPD credible intervals (bottom-right), and 
TBinomialEfficiencyFitter (bottom- left) . 



494 distribution. It is assumed that the read-out electronics (e.g. a photomul- 

495 tiplier tube read by a charge integrator) is tuned to have a dynamic range 

496 large enough that the peak of the MIP energy distribution is not very distant 

497 from the pedestal (figure [3j left panel; the energy is in arbitrary units, e.g. 

498 ADC counts). The experimental setup is triggered by a comparator whose 

499 threshold is somewhere in between the two peaks. Due to the electronic jit- 

500 ter, the comparator does not apply a sharp cut on the distribution. Rather, 

501 a smooth "turn-on" function (figure [3j right panel) is obtained, due to the 

502 random fluctuations on the difference between the measured energy and the 

503 threshold. The ideal (and quite common) case of Gaussian fluctuations has 

504 been simulated, so that the turn-on curve is an "error function" (the Gaussian 

505 integral from minus infinity to the considered value). 

506 A total of 1 million events has been simulated, with energy following the 

507 distribution shown in the left plot of figure [3j Each event has been "rejected" 

508 accordingly to the true threshold function shown in the right plot of the same 

509 figure^] Later, the best estimate of the efficiency has been obtained by taking 

510 the bin-wise ratio of the energy histograms filled for the events passing the 

511 threshold and for all events. 

Figure [4] shows the full sample, before and after the cut, and four different 



7 The event was rejected when the value returned by a uniform generator was higher 
than the threshold function computed at the event energy. Both energy and the decision 
have been saved on file for later use. 
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fits performed with the following function: 



g{x) = b + - 1 + erf 



x — t 



V2w 



) 



(13) 
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where t is the best estimate of the threshold position, w the threshold width 
(Gaussian standard deviation), p is the plateau value reached at high energy 
(left fixed at 1 in the fits) and b is the lowest efficiency (i.e. the "offset" of 
the "turn-on" curve, left fixed at zero). 

As expected with so many events, the fit results using the Gaussian ap- 
proximation (ROOT option "B" of THl::Divide()), the standard deviations 
computed with the Jeffreys prior, the asymmetric credible intervals recom- 
mended by Paterno, or TBinomialEfficiencyFitter agree with each other. The 
latter method underestimates the threshold jitter obtaining a result which is 
not compatible with the true value (0.2 a.u.) within the quoted uncertainty, 
whereas the values from the other fits are at about 2 standard deviations 
from the true model. Apart from TBinomialEfficiencyFitter, which makes 
use of the two histograms filled before and after the cut, the other fits use 
the value of the ratio in each bin and the uncertainty which we have assigned 
following different methods. The latter fits have been made with the option 
"ME" of THl::Fit() to select error estimation using the Minos technique and 
the improved fit results by TMinuit. Another possible approach is to use 
the option "LL" to use the log-likelihood method instead of the chi-square 
method, when the bin contents are not integer values (such option is not sup- 
ported by TGraph::FitQ). In this case, the estimated errors are much larger 
than with "ME" (also, the reported quality of the fit is worse) but again the 
results are compatible with the true model. 

Next, the sample has been divided into 100 subsets of 100, 1000, and 
10000 events each, and repeated turn-on fits have been attempted with all 
methods. The values of the threshold and width have been histogrammed 
(only when the fit was successful) as shown in the figures [5] and |6j in which 
the fit options "ME" and "LL" are compared for the "binomial errors" (option 
"B" for the histogram division) and the reference standard deviations, and 
in figure [7j showing the results by TBinomialEfficiencyFitter and by the chi- 
square fit of the graph showing Paterno's HPD credible intervals. 

These figures show that the log-likelihood fit obtains results which are 
more closely clustered around the true values when using the Gaussian ap- 
proximation for the errors (figure [5]) . As noticed above, this approximation 
is very bad when one has most points at zero or full efficiency, because it 
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Figure 5: Distribution of the fit parameters obtained with the binomial errors. Chi-square 
(left) and log-likelihood (right) fitting methods are compared. 




Figure 6: Distribution of the fit parameters obtained with the square root of the reference 



variance (11). Chi-square (left) and log-likelihood (right) fitting methods are compared. 




Figure 7: Distribution of the fit parameters obtained with TBinomialEfficiencyFitter (left) 
and HPD credible intervals (right). 
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546 assigns zero uncertainties to such cases. This problem is visible also on fig- 

547 ure |4j in which the fit with binomial errors has a very low number of degrees 

548 of freedom, compared to the other three cases: bins with zero uncertainty are 

549 not counted as degrees of freedom. When using the option "LL" the number 

550 of degrees of freedom increases to 148, so that this is preferrable when using 

551 the approximation of binomial errors. 

552 In our example, the turn-on is so sharp that, expecially for small samples, 

553 is is very likely that there is no single bin at intermediate efficiency, which 

554 makes the fit fail in most cases with the option "ME", though in less cases 

555 with the option "LL". Even with very large samples, the passage from zero 

556 to full efficiency happens in a couple of bins only, so that this is admittedly 

557 a quite pathological situation. 

558 Having no bin at intermediate efficiencies is a problem also for the other 

559 fitting techniques, though TBinomialEfficiencyFitter seems to suffer less about 

560 it. Its clustering around the true values is acceptable and similar to the fit 

561 done with Paterno's HPD credible intervals (figure [7]), but the latter fails 

562 many more times. 

563 Maximizing the log-likelihood instead of minimizing the chi-square helps 

564 reducing the number of failing fits also when using the standard deviations 

565 computed with the Jeffreys prior, though the clustering around the true 
see values does not improve significantly. 

567 In conclusion, TBinomialEfficiencyFitter is the most robust way of fitting 

sea the efficiency, though it appears to underestimate the width of the turn-on 

569 curve. If the samples are large enough that it is safe to use the Gaussian 

570 approximation, then it is better to use the option "LL" when fitting the 

571 histogram with binomial errors. With the same option, the clustering around 

572 the true values is almost the same for the binomial errors and for the reference 

573 standard deviations. However, the latter are to be preferred if the default 

574 (chi-square) fitting method is used. Finally, though it is possible to directly fit 

575 the graph showing Paterno's HPD credible intervals, we recommend to plot 

576 them (they are the only asymmetric efficiency intervals already available in 

577 ROOT, though one could implement other things^]) superimposed with the 

578 function resulting from a fit performed in one of the other methods, with the 

579 only exclusion of the chi-square fit performed with binomial errors, which is 



8 For coherence, it would be best to couple TBinomialEfficiencyFitter with the 95% 
confidence intervals defined on chapter 32 of [2]. 
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580 the worst. 



581 5. Summary 

582 Estimating the selection efficiency is a fundamental task in most data 

583 analyses, based on simulated and/or real data. The measured relative fre- 

584 quency provides the best estimate of the true efficiency in the frequentist 

585 approach and coincides with the posterior mode obtained in the Bayesian 

586 treatment with uniform prior. However, such prior cannot be considered 

587 non-informative. Instead, if we are completely uncertain about the efficiency 

588 before making the experiment, the use of the Jeffreys prior is recommended. 

589 In general, if some prior knowledge is available, it is recommended to en- 

590 code it into a function belonging to the family of Beta distributions (whose 

591 parameters can be determined with the "method of moments" if an approx- 

592 imation is needed). This ensures that also the posterior belongs to the same 



607 
608 



593 family, so that all properties summarized in appendix A. 2 are immediately 

594 available. An important examples of the use of informative priors (written 

595 as Beta densities) is the conbination of independent samples: it is the correct 

596 way to combine independent measurements and the best way for including 

597 prior knowledge coming from simulations. 

598 The knowledge of the uncertainty on the efficiency is needed when scaling 

599 observed quantities to estimate their original values (e.g. the true rate). In 

600 this case, the recommended approach is to use the mean and variance of the 

601 posterior PDF in the computation, whenever the use of the full posterior is 

602 not practical. The usual variance algebra holds, with the caveat that the 

603 square root of the final variance might not be good to define a symmetric 

604 credible interval, because of the inherent asymmetry of the posterior in the 

605 general case. Though in many applications the posterior will be significantly 

606 peaked around the true value, so that the Gaussian (symmetric) approxi- 
mation holds, care needs to be taken when handling very low or very high 
efficiencies, and when the number of events is relatively small. 

609 When communicating the result of an efficiency measurement, the rec- 

eio ommended approach is to provide the 4 Beta parameters corresponding to 

en the posterior and prior PDFs, so that the user will be able to test the effect 

612 of different priors and to combine the posterior with other independent mea- 

613 surements. In the plots, the observed frequency should be accompanied by 

614 asymmetric error bars. Paterno's HPD credible intervals are already available 
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615 in ROOT via TGraphAsymmErrors::BayesDivide() but are not necessarily 



616 the best choice: reference credible intervals would be a better option. 

617 When fitting the efficiency with a theoretical function, few different meth- 

618 ods are available in ROOT. As it is shown in section |4.5[ it is important to 

619 avoid performing a chi-square fit using binomial errors (which unfortunately 
seems to be the easiest choice). The fit performed with the reference stan- 
dard deviations is recommended whenever it converges, and it is suggested 

622 to try using the option "ME" first, and switch to "LL" in case of failure. 

623 Finally, special care must be used when handling samples that do not 
have unit weights or are not independent. Few recipes to deal with the most 
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625 common use cases in particle physics have been sketched in the last section. 

626 A. Useful relations 

627 This appendix summarizes mathematical definitions and properties that 

628 are useful to deal with binomial processes. They can be found in standard 

629 books like 

630 A.l. Gamma function 

The Gamma function is defined on the complex plane (z £ C): 

roc 

T(z) = / t*-V-*dt (14) 
Jo 

631 with T(z + 1) = zY{z). For integer values, T(n) = (n — 1)!. 

632 A. 2. Beta distribution 

The Euler Beta function is a symmetric function of a, b G R: 

B(a,b) = C t a -\l - tf- 1 dt = y F ^ = B(b, a) (15) 



o 



T(a + b) 



and the incomplete Beta function is 

B x (a,b)= [ ^(1 -tf^dt . (16) 
Jo 

633 with x G [0, 1]. 
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For x G [0, 1], the Beta distribution is 

f(x; a, b) = — h ^ a ~ l {l - xf- 1 = Be{x; a, b) (17) 
and its cumulative distribution function is 

F(x; a, b) = f f(t; a, b) dt = §^ = I x (a, b) (18) 
Jo B{a,b) 

634 where I x (a,b) = 1 — Ii^ x (a,b) is the regularized incomplete Beta function. 

635 The mean E, mode m and variance V of the Beta distribution (17) are 

E(x;a,b) = (19) 
m ( X ;a,b) = -i=i^ (20) 

= m^m (21) 

In case one needs to find the PDF of some function of two random vari- 
ables following the Beta density, it might be useful to play with the charac- 
teristic function instead: 

<p(t) = [ Be{x]a,b)exp(-2-Kixt)dx = ^(a; a + b; it) (22) 
Jo 

636 where i-F\(a; 6; c) is the confluent hypergeometric function of the first kind. 

637 A. 3. Continuous extension of the binomial distribution 

From the relation 



□ = [in + 1) B(k + 1, n - k + I)]' 1 

and the definition (17) one finds the relation 

(n + l)P(k\n,p) = f(p;k + l,n-k + l) (23) 

which can be used to extend the binomial distribution and its cumulative 
distribution function to the continuous limit (fc £ 1): 

F(k\n, p) = P(X <k) = h_ p (n - k, k + 1) . (24) 
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By deriving (24) with respect to k one may also obtain a PDF that 
generalizes the binomial distribution: 



ftu \- d r?fu ^ d B p (n-k,k+l) . . 

f(k\n,p) = Yk F{k\n,p) = -Q- k B{n _ k ^ k + 1) ■ (25) 

The biggest problem here is the partial derivative of the incomplete Beta 
function 

d 

—B x (a, b) = B x (a, b) log(x) - x a [T(a)] 2 sF 2 (a, a, 1 — b; a + 1, a + 1; x) 

in which it appears the regularized generalized hypergeometric function 
~ _ . _ p Fg(ai, . . . ,a p ;bi, . . . , b q ; x) 

pi q \ui, . . . , Up, ux, . . . , u q , — ^ ^ . , 

defined in terms of the hypergeometric series 

(ai)n • • • (a P )n ^ n 
(i (6i) n • • • (6,)„ n! 



,F g (ai, . . . , a p ; b x , . . . , b q ;x) = 



638 where (a) n = a(a + l)(a + 2) • - • (a + n — 1) is the the "rising factorial" or 

639 Pochhammer symbol. 

640 A. 4- Posterior density for the difference 

641 We have here two random variables that correspond to the selection ef- 

642 ficiencies e + and e_ for the samples with positive and negative weights con- 



643 sidered in section 4.1 Here we use a uniform prior for both so that their 

644 posteriors, having counted n + and n_ initial events and k + and k- entries 

645 after the selection, are Beta distributions with parameters a + = k + + 1, 

646 b + = n + — k + + 1 and gl = k^ + 1, 6_ = n_ — fc_ + 1. 

647 We want to find the posterior for the difference e = e + — £_ using the 

648 general result found by Pham-Gia et al. [14J. Their expression is valid for 

649 the general difference of two Beta-distributed random variables, with domain 

650 ranging from —1 to +1. However, we know that the physical efficiency can not 

651 be negative, hence we restrict the posterior to [0, 1] (the normalization needs 

652 to be recomputed). The correspondence between their and our notation, 

653 when dealing with the posterior under the assumption of uniform priors for 

654 e + and e_, is: ai = l + k + , ct<i = 1 + fi\ = 1 + n + — k + , 02 = 1 + n- — k_. 
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Their result (equations (2a) and (2c) in [Tl] ) can be rewritten in our case 
(being a\ > 1) in a more compact form, which makes use of the third Appell 
hypergeometric function 

F 3 (a,b,c,d; e;x,y) = 
(a) m (b) n (c) m (d) n x m y n (26) 



CXD CXD 

EE 



obtaining an expression valid for < e < 1 (to be renormalized): 

/(e) <x 



f(e;a 2 ,/3 1 ) ^aa+ft-i 



/(£;«!, /3i)/(e;a 2 ,/3 2 ) v (27) 
*F 3 (Pi, a 2 , 1 - ati, 1 - f3 2 ; a 2 + ft; 1 - e, 1 - e) . 



655 B. Intrinsic credible regions 

656 An possible choice for a credible interval is the "lowest posterior loss" or 

657 LPL credible region, which depends on the definition of the "loss function" 

658 which specifies the loss to be suffered if a particular value for the efficiency 

659 is used in place of the true value. Here we summarize the treatment by 

660 Bernardo [10]. To obtain an invariant LPL credible region, the loss function 
eel should depend on the full PDF, not on the value of a parameter. For example, 

662 the common choice of the quadratic distance I = (eq — e) 2 does not lead to 

663 an invariant solutions. 

664 Let us consider a model with probability distribution p(x\9) for the ob- 

665 servations x G X C W 1 , n > 1, dependent on the parameters 9 G C M m , 
eee 1 < m < n. An "intrinsic loss function" is a symmetric non-negative func- 
667 tion £(9q,9) which is zero if and only if p(x\9 ) = p(x\9) almost everywhere 
ees in X. An intrinsic loss function is invariant under reparametrization but 

669 not, in general, under a one-to-one transformation of the observations x. Be- 

670 cause this is a very desirable property, we restrict ourselves to the intrinsic 

671 loss functions which are also invariant under one-to-one transformations of 

672 x. An example from this class is the the L\ norm, that is the integral of the 

673 absolute value for the difference between two distributions, computed at the 

674 same point x, over the whole support X. When applied to the binomial case, 
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675 the Ci norm gives the invariant expected loss 



£i{e \k,n} = [ £x(e , e) Be{e\k + §, n - k + \) de (28) 
Jo 

n 

^(e ,£) = ^2\Bi{k\e Q ,n) -Bi(k\e,n) | (29) 

fc=0 

676 independent from one-to-one transformations of e. One can now build a LPL 

677 g-credible region by finding the interval [£iow,£ U p] C [0,1] which minimizes 

678 (28) under the constraint J^ p Be(e\k + — k + \) de = q. 

679 The behaviour of many important limiting processes in probability theory 

680 and statistical inference is better described in terms of another measure of di- 

681 vergence, related to the information theory, the intrinsic discrepancy, defined 

682 in terms of the Kullback-Leibler "directed divergence" ft{pi,p 2 } between two 

683 PDFspi,^ 

b{Pi,P2\ = mm {K{p 1 ,p 2 },K,{p 2 ,p 1 }} (30) 

K{Pi,Pj} = / Pi(x)\og^-—dx (31) 
Jv Pj\ x ) 

The intrinsic discrepancy is symmetric, non-negative, defined for strictly 
nested supports, invariant under one-to-one transformations, and additive 
for independent observations. It may be viewed as the minimum expected 
log-likelihood ratio in favour of the model which generates the data (the 
"true" model, which is assumed to be described either by pi or p 2 ) and can 
be used to defined the intrinsic discrepancy loss 

5 x {9 ,9} = 5{p(x\9 ),p(x\9)} (32) 

684 where 9 is the parameter in which we are interested. For the binomial model, 

685 the intrinsic discrepancy loss is 

5 k {e ,e\n} = n5 x {e ,e} (33) 
S x {e ,e} = mm{K{e \e},K{e\e }} (34) 

K{£i\Sj} = Ej log - 1 - + (1 - Ej) log _ 3 (35) 
E{ 1 Ei 

ese where ^{^o;^} is the intrinsic discrepancy between Bernoulli random vari- 
687 ables with parameters Eo and e. 
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Finally, intrinsic credible regions are defined as the lowest posterior loss 
credible regions which correspond to the use of the intrinsic discrepancy 
loss function together with the reference posterior. The reference posterior 
expected loss from using e rather than e in the binomial model is 

d{9 \k,n}=n / 5 x {eo, e}Be(e\k + |, n — k + |) de (36) 
Jo 

and the intrinsic g-credible interval is the interval [£i OW) £up] C [0,1] which 
minimizes the loss (36) under the constraint J £ ^ up Be(e|/c+|, n — k+^)de = q. 

Approximate expressions for the intrinsic credible intervals are based on 
the asymptotic expressions obtained in theorem 4.1 of Bernardo [TO] , In 
particular, they are built upon the "reference parametrization" = <p(9) of 
the parameters 9 of interest, for which the reference prior is uniform. For 
the binomial model there is a single parameter e and = 2 arcsin \/e, i.e. 
£ = sin 2 (0/2). Using a shorter notation for the reference posterior mean 
H = E(e) = (k + 1/2) /(n + 1) and variance a 2 = V(e) = /x(l - fi)/(n + 2) of 
the parameter of interest, the variance of the reference parametrization is 

1 

n + 2 

while its mean is 
1 



ol « = (37) 



H « 0(/i) + 2 a 



k + 1/2 (k + 1/2 - k+ l/2)- 1 / 2 (2fc - n) 

2 arcsin a / + - — K —— i - 

n + 1 4(n + 2) 



(38) 



where 0' and 0" denote the first and second derivative with respect to e. Fi- 
nally, the asymptotic intrinsic g-credible interval in the reference parametriza- 
tion is [0_,0 + ] where 

<f>± « {if ± z g o> = n+ ± -^=== (39) 

ego where z g is the (g + l)/2 quantile of the normal distribution. The intrinsic 
69i g-credible interval for the efficiency is obtained by transforming back to e± = 
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